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Abstract 

Deep learning can achieve outstanding results 
in various fields. However, it requires so sig¬ 
nificant computational power that graphics pro¬ 
cessing units (GPUs) and/or numerous comput¬ 
ers are often required for the practical applica¬ 
tion. We have developed a new distributed cal¬ 
culation framework called ’’Sashimi” that allows 
any computer to be used as a distribution node 
only by accessing a website. We have also de¬ 
veloped a new JavaScript neural network frame¬ 
work called ’’Sukiyaki” that uses general purpose 
GPUs with web browsers. Sukiyaki performs 
30 times faster than a conventional JavaScript 
library for deep convolutional neural networks 
(deep CNNs) learning. The combination of 
Sashimi and Sukiyaki, as well as new distribution 
algorithms, demonstrates the distributed deep 
learning of deep CNNs only with web browsers 
on various devices. The libraries that comprise 
the proposed methods are available under MIT 
license at http://mil-tokyo.github.io/. 

1. Introduction 

Utilization of big data has recently come to play an in¬ 
creasingly important role in various business fields. Big 
data is often handled with deep learning algorithms that 
can achieve outstanding results in various fields. For exam¬ 
ple, almost all of the teams that participated in the ILSVRC 
2014 image recognition competition (Russakovsky et al., 
2014) used deep learning algorithms. Such algorithms are 
also used for speech recognition (Dahl et al., 2012; Hin¬ 
ton et al., 2012) and molecular activity prediction (Kaggle, 
Inc.). 

However, deep learning algorithms have huge computa¬ 


tional complexity and often require the use of graphics pro¬ 
cessing units (GPUs) and numerous computers for prac¬ 
tical distributed computation. It is difficult to construct 
a distributed computation environment, and it frequently 
requires the preparation of certain operating systems and 
installation of specific software, e.g., Hadoop (Shvachko 
et al., 2010). For practical application of deep learning al¬ 
gorithms, a new instant distributed calculation environment 
is eagerly anticipated. 

2. Sashimi: Distributed Calculation 
Framework 

We have developed a new distributed calculation frame¬ 
work called ’’Sashimi.” In a general distributed calculation 
framework, it is difficult to increase the number of node 
computers because users must install client software on 
each node computer. With Sashimi, any computer can be¬ 
come a node computer only by accessing a certain web¬ 
site via a web browser without installing any client soft¬ 
ware. The proposed system can execute any code written 
in JavaScript in a distributed manner. 

2.1. Implementation 

Sashimi consists of two servers, the CalculationFramework 
and Distributor servers ( Figure 1 ). When a user runs a 
project that includes distributed processing using the Cal¬ 
culationFramework and accesses the Distributor via web 
browsers, the processes can be distributed and executed in 
multiple web browsers. The results processed by the dis¬ 
tributed machines can be used as if they were processed 
by a local machine without being conscious of their differ¬ 
ences by using the CalculationFramework. 

2.1.1. CalculationFramework 

CalculationFramework describes calculations that include 
distributed processing. If a user writes code according to 
a certain interface, the processes are distributed and com¬ 
puted in multiple web browsers via Distributor. The results 
computed in the web browsers are then automatically col- 
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Figure 1 . Sukiyaki Architecture 


lected. This allows the results to be used as if they were 
processed by the local machine. In this subsection, we in¬ 
troduce PrimeListMakerProject, which finds prime num¬ 
bers from 1 to 10,000, as an example. 

A ’’project” is a programming unit of CalculationFrame- 
work that has an endpoint from which a process starts. In a 
project, a user can execute distributed processing by creat¬ 
ing a task instance. Note that processes that do not require 
distributed processing are also supported. 

A ’’task” is a process that is distributed and executed in 
web browsers. If a user writes a task according to a cer¬ 
tain interface, the arguments are automatically divided and 
distributed to web browsers. The processed results are 
then automatically collected. In PrimeListMakerProject, 
the task that determines whether an input integer is a prime 
number is called IsPrimeTask. This task is distributed 
among web browsers. The task is given arguments gener¬ 
ated by the project, and it must return the calculated results 
using a callback function. Note that the user can use ex¬ 
ternal libraries and datasets. In this example, the task calls 
is .prime function in an external library, which determines 
whether the input integer is a prime number. 

After the project generates the task instance with argu¬ 
ments, the framework generates tickets for each divided 
argument. The framework sends the codes and the argu¬ 
ments as tickets with the external libraries and datasets to 
the Distributor via MySQL. The tickets distributed by the 
Distributor are collected and can be used again by the Cal- 
culationFramework via MySQL. Since the project is im¬ 
plemented with server-side JavaScript Node.js and the task 
is implemented with JavaScript, they can be used without 
considering whether the code is executed by the server or 
the browsers. 

2.1.2. Distributor 

The Distributor distributes tasks and tickets, which are sent 
from the CalculationFramework via MySQL, to browsers. 
The Distributor also collects the results calculated in the 


browsers. Note that the Distributor consists of two servers, 

i.e., HTTPServer and TicketDistributor. 

HTTPServer, which is a web server implemented with 
Node.js, provides static files that include a basic program 
and discloses APIs that offer datasets to be used in the dis¬ 
tributed calculation. If a user wants to make a browser 
function as a node, the user only needs to access the ba¬ 
sic program provided by HTTPServer via the web browser. 
The basic program consists of a static HTML file and a 
JavaScript file. The basic program works as follows. 

1. a connection with TicketDistributor is established us¬ 
ing WebSocket 

2. a ticket request is sent to TicketDistributor 

3. a task request is sent to TicketDistributor if it has not 
downloaded the task described in the ticket 

4. a request for required external datasets and files is sent 
to HTTPServer 

5. the task with arguments described in the ticket is exe¬ 
cuted 

6. the calculated result is returned to TicketDistributor 

7. return to Step 2. 

The task and external data are cached in the browser. If 
a program runs for a long time, memory usage increases 
due to the cache. Therefore, we have implemented garbage 
collection on the basis of the least recently used algorithm. 
If an error occurs when the task is running, an error report 
that includes a stack trace is sent to the TicketDistributor. 
Then, the browser reloads itself. Thus, a task described by 
the tickets generated by the CalculationFramework can be 
continuously executed without special maintenance once 
the user accesses the program. 

Users can check the progress of a task and tickets via the 
HTTPServer control console. In this console, users can see 
the project name, the number of tasks, the number of tick¬ 
ets waiting to be processed, the number of executed tickets, 
the number of error reports, and the client information for 
the project. Note that a console that can be used to execute 
code in web browsers is also provided. With this console, 
the user can make the browsers reload themselves and redi¬ 
rect to another distributed system. 

We use responsive web design (RWD) techniques in the 
user interface (UI) of the basic program and the control 
console. RWD techniques adapt the UI to the screen size 
of any PC, tablet, or smartphone, which makes it easy to 
use any devices for distributed calculation and check of the 
progress. 
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The tasks and tickets generated by the CalculationFrame- 
work are distributed to browsers via WebSocket by the 
TicketDistributor. The processed results are also collected 
by the TicketDistributor. Unlike HTTPServer, the Ticket- 
Distributor runs in a single process and communicates with 
each web browser unitarily and efficiently. 

When the TicketDistributor receives a ticket request from 
the browser, it obtains tickets in ascending order of ’’virtual 
created time” from the MySQL server. Virtual created time 
is determined as follows. 

• the virtual created time is the ticket creation time of 
undistributed tickets 

• virtual created time is five minutes after ticket distri¬ 
bution if the tickets have been distributed 

• virtual created time is five minutes after the last ticket 
distribution if the ticket has been redistributed 

Thus, if the results are not returned within five minutes, the 
tickets is treated in such a way as to be re-created. Note that 
tickets are redistributed in ascending order by distribution 
time if there are no further tickets to be distributed. Thus, if 
a web browser is terminated after it receives a ticket, and/or 
if there are clients with low computational capability, an¬ 
other client can execute the task. Therefore, the throughput 
can be enhanced. The tickets are redistributed at intervals 
of at least 10 seconds, which prevents the last ticket from 
being distributed to many clients and prevents the next cal¬ 
culation from being delayed. We implemented this algo¬ 
rithm using SQL, which can quickly select tickets to be 
distributed. 

2.2. Benchmark 

2.2.1. Experiment Condition 

Using Sashimi, we demonstrate that a task with high com¬ 
putational cost can be computed in parallel efficiently. 
Here, we compare the time required to classify the MNIST 
dataset with a nearest neighbor method by changing the 
number of clients. In this experiment, 1,000 images from 
the 10,000 MNIST test images were classified by compar¬ 
ing them with 60,000 training images. We used one to four 
clients on a desktop computer and tablet PCs described in 
Table 1 . We accessed the Distributor using the Google 
Chrome web browser on both the desktop and tablet envi¬ 
ronments. 

2.2.2. Results 

The results are shown in Table 2 . In both environments, 
the calculation time was reduced with the distributed com¬ 
putation. The effect of the distributed computation was 


Table 1. Specifications of Devices for Distributed MNIST Bench¬ 
mark 



DELL OPTIPLEX 

Nexus 7 

Model 

DELL OPTIPLEX 8010 

Nexus 7 (2013) 

OS 

Windows 7 Professional 

Android 4.4.4 

CPU 

Intel Core 17-3770 3.4GHz 

Krait 1.50GHz 

RAM 

16GB 

2GB 


remarkable when the proposed system was used with the 
tablet PC because the tablet has lower computational power 
than the desktop computer and the overhead time required 
for the distribution becomes relatively shorter. We believe 
that the proposed distributed computing method will be¬ 
come more effective for other feature extraction methods 
with high computational costs such as SIFT and deep learn¬ 
ing. 

3. Sukiyaki: Deep Neural Network 
Framework 

We implemented a learning algorithm for deep neural net¬ 
works (DNNs) with browsers based on Sashimi. Here, we 
explain the proposed framework and implementation for 
DNNs. We also discuss the advantages of the proposed 
method over an existing library in a stand-alone environ¬ 
ment. The distributed computation of DNNs is explained 
in the next section. 

We primarily implemented deep convolutional neural net¬ 
works (deep CNNs) that obtain high classification accuracy 
in image recognition tasks. ConvNetJS (Karpathy) was im¬ 
plemented as a NN library using JavaScript. However, its 
computational speed is limited because it runs on only a 
single thread. Therefore, we developed a deep neural net¬ 
work framework called ’’Sukiyaki” that utilizes a fast ma¬ 
trix library called Sushi (Miura et al., 2015). The Sushi ma¬ 
trix library is fast because it is implemented on WebCL and 
can utilize general purpose GPUs (GPGPUs) efficiently. 

3.1. Implementation 

The Sukiyaki DNN framework consists of the Sukiyaki ob¬ 
ject, which handles procedures for learning and testing in 
the neural network, and layer objects. In this version, for 
deep CNNs, we implemented a convolutional layer, a max 
pooling layer, a fully-connected layer, and an activation 
layer. Note that we can add other layers if we implement 
certain methods such as forward, backward, and update. 
The forward, backward, and update methods in each layer 
are implemented using the Sushi matrix library; thus, they 
can be executed in parallel on GPGPUs. 

We can use AdaGrad (Duchi et al., 2011) as an online 
parameter leaning method, which can learn parameters 
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Table 2. Results of Distributed MNIST Benchmark 


Environment 

Clients 

Elapsed Time (sec.) 

Elapsed Time (ratio) 

DELL OPTIPLEX 

1 

107 

1 

2 

62 

0.58 

3 

52 

0.49 

4 

46 

0.43 

Nexus 7 

1 

768 

1 

2 

413 

0.54 

3 

293 

0.38 

4 

255 

0.33 


quickly. The original update rule in AdaGrad is as follows. 

= Oi^t-1 - / ^ 

Y ^^u=0 9i^u 

where, a is a scalar learning rate, Oi^t is the i-th element of 
the parameter at time step t and gi^t is the i-th element of the 
gradient at time step t. However, in this update rule, learn¬ 
ing usually becomes unstable because the sum of squared 
gradients is minuscule early in the learning process. There¬ 
fore, we have modified the update rule using a constant /3 
as follows. 

\JP + SL=o 9i,u 

We designed the Sukiyaki DNN framework to be used with 
both Node.js and browsers so DNNs can be trained in a 
distributed manner using the Sashimi distributed calcula¬ 
tion framework. For example, a model file wherein the pa¬ 
rameters are encoded with base64 is formatted in JSON. 
Note that although the model file is a platform independent 
string format, it can be exchanged among machines without 
rounding errors. 

3.2. Benchmark 

3.2.1. Experiment Condition 

We compared the learning speed of the Sukiyaki DNN 
framework with that of ConvNetJS, which is an existing 
JavaScript NN library. 

In this experiment, we used the deep CNN model shown in 
Figure 2 . Fifty images per mini-batch were learned from 
the 50,000 training images in cifar-10 (Krizhevsky, 2009). 
Note that cifar-10 consists of 24-bit 32 x 32 color images 
in 10 classes. The model convolves the input images with 
5x5 kernels in each convolutional layer and generates 
three feature maps of size 32 x 32 x 16, 16 x 16 x 20, 
and 8 X 8 X 20. Each convolutional layer is followed by an 
activation layer and a max pooling layer, and the size of the 
output is halved. The fourth layer is a fully-connected layer 


Table 3. Specifications of Device for Neural Network Libraries 
Benchmark 


Model 

MacBook Pro (Retina, 13-inch, Late 2013) 

OS 

Mac OS X Yosemite (10.10.1) 

CPU 

Intel Core i5-4258U 2.4GHz 

GPU 

Intel Iris 

RAM 

8GB 


Table 4. Numbers of Batches Learned per 1 min. 


ConvNetJS 

Sukiyaki 

Node.js 

Firefox 

Node.js 

Firefox 

17.55 

2.44 

545.39 

31.39 


that converts 320 input elements to 10 output elements by 
estimating class probabilities via the softmax function. 

We trained the network using Node.js and the Firefox web 
browser with a MacBook Pro (specs described in Table 3 ) 
and compared the learning speeds. 

3.2.2. Results 

The results are shown in Table 4 and Figure 3 . As 
observed, Sukiyaki learned the network faster than Con¬ 
vNetJS for both Node.js and Firefox. The convergence 
speed of Sukiyaki was also faster than that of ConvNetJS. 
Note that Sukiyaki with Node.js learned the network 30 
times faster than ConvNetJS. 

4. Distributed Deep Learning 

We ran the Sukiyaki DNN framework on the Sashimi 
distributed calculation framework and realized distributed 
learning of deep CNNs. 

4.1. Distribution Algorithms 

Some previous studies have proposed methods for dis¬ 
tributed learning of DNNs and deep CNNs. 

Dean et al. (Dean et al., 2012) proposed DistBelief, which 
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is an efficient distributed computing method for DNNs. In 
DistBelief, a network is partitioned into some subnetworks, 
and different machines are responsible for computation of 
the different subnetworks. The nodes with edges that cross 
partition boundaries must share their state information be¬ 
tween machines. However, since we consider machines 
connected via the Internet with slow throughput, it is diffi¬ 
cult to share the nodes’ state between different machines 
with the proposed framework. DistBelief focuses on a 
fully-connected network; thus, it is also difficult to directly 
apply this approach to convolutional networks that share 
weights among different nodes. 

Meeds et al. (Meeds et al., 2014) developed MLitB, 
wherein different training data batches are assigned to dif¬ 
ferent clients. The clients compute gradients and send them 
to the master that computes a weighted average of gradients 
from all clients and updates the network. The new network 
weights are sent to the clients, and the clients then restart to 
compute gradients on the basis of the new weights. This ap¬ 
proach is simple and easy to implement; however, it must 
communicate all network weights and gradients between 
the master and the clients. Thus, the communication over¬ 
head becomes excessively large with a large network. 

Krizhevsky (Krizhevsky, 2014) proposed a method to par¬ 


allelize the training of deep CNNs using model paral¬ 
lelism and data parallelism efficiently. Generally, deep 
CNNs consist of many convolutional layers and a few fully- 
connected layers. Due to weight sharing, convolutional 
layers incur significant computational cost relative to the 
small number of parameters. However, fully-connected 
layers have many more parameters than convolutional 
layers and less computational complexity. Krizhevsky 
(Krizhevsky, 2014) developed an efficient method to par¬ 
allelize the training of deep CNNs by applying data par¬ 
allelism in the convolutional layers and model parallelism 
in the fully-connected layers. However, we focus on dis¬ 
tributed computation via the Internet; thus, we must reduce 
communication costs in the proposed framework. 

He et al. (He et al., 2015) implemented another effective 
method for distributed deep learning. They parallelize the 
training of convolutional layers using data parallelism on 
multi-GPUs. Then the GPUs are synchronized and the 
fully-connected layers are trained on a single GPU. The 
computational complexity for training fully-connected lay¬ 
ers is relatively small so that the model parallelism of fully- 
connected layers does not necessarily contribute to fast 
learning. This method of combining parallelized and stand¬ 
alone learning works efficiently and is easy to implement. 
However, this method has some computational resources 
stay idle while fully-connected layers are learned on a sin¬ 
gle GPU and there is still room for improvement. 

SING A (Wang et al.) is a distributed deep learning plat¬ 
form. It supports both model partition and data partition, 
and we can manage them automatically through distributed 
array data structure without much awareness of the array 
partition. While SING A is designed to accelerate deep 
learning by using MPI, it is unclear whether this approach 
is appropriate for distributed calculation via the Internet. 

In this study, we have developed a new method, which 
parallelizes the training of convolutional layers using data 
parallelism and does not apply model parallelism to fully- 
connected layers. The proposed method trains fully- 
connected layers on the server while the clients train the 
convolutional layers. Unlike the method of He et al. (He 
et al., 2015), convolutional layers and fully-connected lay- 
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Figure 5. Learning Speed by Distributed Deep Learning 


ers are learned concurrently. This method reduces com¬ 
munication cost among machines and utilizes the compu¬ 
tational capability of the server while it awaits responses 
from the clients. 

4.2. Benchmark 

4.2.1. Experiment Conditions 

In this experiment, we parallelized the training of the deep 
CNN shown in Figure 4 using the Sukiyaki DNN frame¬ 
work and the Sashimi distributed calculation framework. 
We used the computer shown in Table 5 as the server 
and clients. The client machine had four GPU cores; thus, 
we ran one to four Firefox web browsers on the client ma¬ 
chine. The browsers accessed the server running Node.js 
and began training the network in parallel. We compared 
the training speeds of both convolution and fully-connected 
layers by varying the number of clients. 

4.2.2. Results 

The results are shown in Figure 5 . The proposed 
distributed computation method can train fully-connected 
layers 1.5 times faster than the stand-alone computation 
method independently of the number of clients because 
the server can be devoted to training fully-connected lay¬ 
ers. The training speed of the convolutional layers becomes 
faster in proportion to the number of clients. With four 
clients to train convolutional layers, the proposed method 
is two times as fast as the stand-alone method. 

5. Conclusion and Future Plans 

In this paper, we have proposed a distributed calcula¬ 
tion framework using JavaScript to address the increasing 
need for computational resources required to process big 
data. We have developed the Sashimi distributed calcula¬ 
tion framework, which can allow any web browser to func¬ 
tion as a computation node. By applying Sashimi to an 
image classification problem, the task can be executed in 
parallel, and its calculation speed becomes faster in propor¬ 


tion to the number of clients. Furthermore, we developed 
the Sukiyaki DNN framework, which utilizes GPGPUs and 
can train deep CNNs 30 times faster than a conventional 
JavaScript DNN library. By building Sukiyaki on Sashimi, 
we also developed a parallel computing method for deep 
CNNs that is suitable for distributed computing via the In¬ 
ternet. We have shown that deep CNNs can be trained in 
parallel using only web browsers. 

These libraries are available under MIT license at 
http://mil-tokyo.github.io/. In future, we plan to improve 
the efficiency of the distribution algorithm by consider¬ 
ing clients’ computational capabilities and supporting an¬ 
other network in Sukiyaki. Note that we welcome sugges¬ 
tions for improvements to our code and documentation. It 
is our hope that many programmers will further develop 
Sukiyaki and Sashimi to become a high-performance dis¬ 
tributed computing platform that anyone can use easily. 
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Table 5. Specifications of Devices for Distributed Deep Learning 



Server 

Client 


Mac Pro 

DELL Alien ware 

Model 

Mac Pro (Late 2013) 

DELL Alien ware Area-51 

OS 
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Windows 8.1 

CPU 

Intel Xeon E5 3.5GHz 6-core 

Intel Core i7-5960X 3.00GHz 8-core 

GPU 

AMD FirePro D500 

NVIDIA GeForce GTX TITAN Z x 2 

RAM 

32GB 

32GB 
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Appendix. Sashimi Sample Program 


Source Code 1. prime_list_maker_proiect.is 


var ProjectBase = require/project_base'); 
var IsPrimeTask = require('./is_prime_task'); 
var PrimeListMakerProject = function () { 

this.name = 'PrimeListMakerProject'; 

this.run = function(){ 

var task = this.createTask(IsPrimeTask); 
var inputs = [ ]; 

for (var i = 1; i <= 10000; i++) { 

inputs.push({ candidate : i }); 

} 

task.caicuiate(inputs); 

task.biock(function (resuits) { 

for (var i = 0; i < resuits.iength; i++) { 

if (resuits [i] .output.is_prime) { 

console.log(i + ' is a prime number.') 


}) ; 


PrimeListMakerProject.prototype = new ProjectBase (); 


Source Code 2. is-prime-task, is 


var TaskBase = require('../task_base'); 

var IsPrimeTask = function() { 

this.static_code_flies = ['is_prime']; 

this.run = function(input, output) { 

if (is_prime(input.candidate)) { 

output({ is_prime : true }); 

} else { 

output({ is_prime : false }); 


} 


}; 


IsPrimeTask.prototype = new TaskBase(); 


_ Source Code 3. is-prime.is 

function is_prime(candidate) { 

for (var i = 2; i <= Math.sqrt(candidate); i++) { 

if (candidate % i === 0) { 

return false; 

} 

} 

return true; 

} 
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